Skip to content

[RL] Reuse GDR checkpoint transfer handle#8078

Open
jackyYang6 wants to merge 2 commits into
PaddlePaddle:developfrom
jackyYang6:jacky/optimize-checkpoint-transfer-handle-init-develop
Open

[RL] Reuse GDR checkpoint transfer handle#8078
jackyYang6 wants to merge 2 commits into
PaddlePaddle:developfrom
jackyYang6:jacky/optimize-checkpoint-transfer-handle-init-develop

Conversation

@jackyYang6

@jackyYang6 jackyYang6 commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Motivation

Avoid repeated CheckpointTransfer initialization during GDR dynamic weight updates. Reusing the initialized handle reduces repeated setup overhead across multiple update steps.

Modifications

  • Cache the GDR CheckpointTransfer handle in DynamicWeightManager.
  • Lazily initialize the handle on the first GDR weight update.
  • Reuse the cached handle for later update_weights_by_gdr calls.
  • Destroy and reset the cached handle when an update fails.

Usage or Command

No new user-facing command. Existing GDR weight update flow is unchanged.

Accuracy Tests

Not applicable. This PR only changes checkpoint-transfer handle initialization behavior and does not affect model outputs.

Checklist

  • Add at least a tag in the PR title.
  • Format your code, run pre-commit before commit.
  • Add unit tests. No unit tests added because this is a handle lifecycle optimization for GDR runtime behavior.
  • Provide accuracy results. Not applicable; no model output changes.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

PaddlePaddle-bot

This comment was marked as outdated.

@jackyYang6 jackyYang6 force-pushed the jacky/optimize-checkpoint-transfer-handle-init-develop branch from ee3f166 to b69ad2a Compare June 25, 2026 11:47
PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

@codecov-commenter

codecov-commenter commented Jun 25, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 86.95652% with 3 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@6d9a8f4). Learn more about missing BASE report.

Files with missing lines Patch % Lines
fastdeploy/rl/dynamic_weight_manager.py 86.95% 2 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             develop    #8078   +/-   ##
==========================================
  Coverage           ?   67.52%           
==========================================
  Files              ?      475           
  Lines              ?    66907           
  Branches           ?    10317           
==========================================
  Hits               ?    45182           
  Misses             ?    18857           
  Partials           ?     2868           
Flag Coverage Δ
GPU 77.55% <86.95%> (?)
XPU 6.95% <0.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Paddle-CI-Agent | pr_review | 2026-06-26 11:04:01

📋 Review 摘要

PR 概述:缓存 GDR CheckpointTransfer handle,避免动态权重更新时重复初始化 transfer 句柄。
变更范围fastdeploy/rl/dynamic_weight_manager.pytests/rl/test_dynamic_weight_gdr.py
影响面 Tag[RL]

问题

未发现新的阻塞性问题。PR 规范问题在下面章节报,不要在这里重复

历史 Findings 修复情况

Finding 问题 状态
F1 _destroy_gdr_handle() 吞掉 cleanup() 异常且没有任何日志。 ⚠️ 仍存在
F2 缓存的 GDR CheckpointTransfer 没有在 sleep/clear 权重路径释放。 ⚠️ 仍存在

📝 PR 规范检查

符合规范。标题使用官方 [RL] Tag,PR 描述包含 checklist §D2 要求的 MotivationModificationsUsage or CommandAccuracy TestsChecklist 章节。

总体评价

本轮按风险优先追溯了 GDR handle 创建、复用、异常清理、runner update/clear/sleep 调用链和新增单测。除历史未解决项外,暂未发现新的需要行间评论的问题。

@PaddlePaddle-bot

PaddlePaddle-bot commented Jun 26, 2026

Copy link
Copy Markdown

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-07-03 01:28:50 UTC+08:00

CI报告基于以下代码生成(30分钟更新一次):
PR commit: 9238c11 | Merge base: 6d9a8f4 (branch: develop)


1 Required任务 : 8/10 通过

总执行(rerun次数) 总任务 ✅ 通过 ❌ 失败 ⏳ 运行中 ⏸️ 等待中 跳过
42(0) 42 39 3 0 0 0
任务 错误类型 置信度 日志
Approval 需要 Approval Job
xpu_8cards_case_test / run_xpu_8cards_cases 不稳定问题 Job

2 失败详情

🔴 Approval — 需要 Approval(置信度: 高)

分析器: builtin

  • 根因摘要: 该 Job 需要人工 Approval。

修复建议:

  1. 请通过人工审批,审批后 CI 才会继续执行。

关联变更: 不适用

🔴 xpu_8cards_case_test / run_xpu_8cards_cases — 不稳定问题(置信度: 中)

分析器: 通用分析(fallback)

失败用例:

用例 错误摘要
tests/xpu_ci/8cards_cases/test_pd_21b_ep4tp1.py::test_pd_separation PD 分离 EP4TP1 服务健康检查通过后,首个请求返回空内容,未命中期望关键词

关键日志:

服务健康检查中... P节点状态码:200,D节点状态码:200
PD分离服务启动成功!耗时 10 秒
模型回复:
PD分离测试失败: 响应内容不符合预期:
assert False
E   AssertionError: 响应内容不符合预期:
FAILED tests/xpu_ci/8cards_cases/test_pd_21b_ep4tp1.py::test_pd_separation
  • 根因摘要: XPU PD 分离首个请求返回空内容。
    日志显示 Router、Prefill、Decode 均启动且健康检查为 200,但 response.choices[0].message.content 为空,导致 tests/xpu_ci/8cards_cases/test_pd_21b_ep4tp1.py:306 的关键词断言失败。同一 job 后续 EP4TP4、CudaGraph、MTP 相关 PD 分离用例通过,且日志中未出现 FD_USE_GDR_CHECKPOINT_TRANSFER;PR 只修改 RL 动态权重更新的 CheckpointTransfer handle 复用逻辑,当前证据不足以指向 PR 代码导致。

修复建议:

  1. 已知不稳定,请 rerun;若重跑仍复现,再排查 XPU PD 分离 EP4TP1 首个请求空回复及 RDMA/cache-transfer 服务端日志。

关联变更: PR 修改 fastdeploy/rl/dynamic_weight_manager.pytests/rl/test_dynamic_weight_gdr.py;未发现与失败用例的直接调用链关联。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants